Spatial Model fitting in GLS

In this exercise we will fit a linear model using a Spatial structure as covariance matrix. We will use GLS to get better estimators.

As always we will need to load the necessary libraries.


In [5]:
# Load Biospytial modules and etc.
%matplotlib inline
import sys
sys.path.append('/apps')
sys.path.append('..')
sys.path.append('../spystats')
import django
django.setup()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
## Use the ggplot style
plt.style.use('ggplot')

import tools

Use this to automate the process. Be carefull it can overwrite current results

run ../HEC_runs/fit_fia_logbiomass_logspp_GLS.py /RawDataCSV/idiv_share/plotsClimateData_11092017.csv /apps/external_plugins/spystats/HEC_runs/results/logbiomas_logsppn_res.csv -85 -80 30 35

Importing data

We will use the FIA dataset and for exemplary purposes we will take a subsample of this data. Also important. The empirical variogram has been calculated for the entire data set using the residuals of an OLS model.

We will use some auxiliary functions defined in the fit_fia_logbiomass_logspp_GLS. You can inspect the functions using the ?? symbol.


In [ ]:


In [6]:
from HEC_runs.fit_fia_logbiomass_logspp_GLS import prepareDataFrame,loadVariogramFromData,buildSpatialStructure, calculateGLS, initAnalysis, fitGLSRobust

In [7]:
section = initAnalysis("/RawDataCSV/idiv_share/FIA_Plots_Biomass_11092017.csv",
                        "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",
                       -130,-60,30,40)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Reprojecting to Alberts equal area
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Removing possible duplicates. 
 This avoids problems of Non Positive semidefinite
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting OLS linear model: logBiomass ~ logSppN 
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Subselecting Region

In [ ]:
#section = initAnalysis("/RawDataCSV/idiv_share/plotsClimateData_11092017.csv",
#                        "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",
#                       -85,-80,30,35)

# IN HEC
#section = initAnalysis("/home/hpc/28/escamill/csv_data/idiv/FIA_Plots_Biomass_11092017.csv","/home/hpc/28/escamill/spystats/HEC_runs/results/variogram/data_envelope.csv",-85,-80,30,35)

In [8]:
section.shape


Out[8]:
(18414, 22)

Now we will obtain the data from the calculated empirical variogram.


In [9]:
gvg,tt = loadVariogramFromData("/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",section)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Reading the empirical Variogram file
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Instantiating a Variogram object with the values calculated before
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Dropping possible Nans
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Instantiating Model...
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:fitting whittle Model with the empirical variogram
../spystats/tools.py:625: RuntimeWarning: divide by zero encountered in power
  g_h = ((sill - nugget)*(1 - np.exp(-(h**alpha / range_a**alpha)))) + nugget*Ih
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Model fitted

In [10]:
gvg.plot(refresh=False,with_envelope=True)



In [11]:
resum,gvgn,resultspd,results = fitGLSRobust(section,gvg,num_iterations=10,distance_threshold=1000000)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.898310288151, {{"Intercept":8.4466941208,"logSppN":0.3949644501},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.4204865441,"logSppN":0.380414251},"1":{"Intercept":8.4729016975,"logSppN":0.4095146491}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
../spystats/tools.py:625: RuntimeWarning: invalid value encountered in double_scalars
  g_h = ((sill - nugget)*(1 - np.exp(-(h**alpha / range_a**alpha)))) + nugget*Ih
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 114072.405376, sill 0.329197277814, nugget 0.313620600222
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.757184811666, {{"Intercept":8.4268688196,"logSppN":0.3933313397},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3660482966,"logSppN":0.377323159},"1":{"Intercept":8.4876893426,"logSppN":0.4093395204}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115372.521709, sill 0.329234784092, nugget 0.313589255985
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753402535991, {{"Intercept":8.4270643468,"logSppN":0.3933083108},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654998811,"logSppN":0.3772973117},"1":{"Intercept":8.4886288125,"logSppN":0.4093193098}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115394.979463, sill 0.329235474303, nugget 0.313588820165
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753335552391, {{"Intercept":8.4270681443,"logSppN":0.3933079549},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654905015,"logSppN":0.3772969087},"1":{"Intercept":8.4886457871,"logSppN":0.409319001}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115396.886797, sill 0.329235547172, nugget 0.313588811779
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753328938987, {{"Intercept":8.427068631,"logSppN":0.3933079362},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654896937,"logSppN":0.3772968861},"1":{"Intercept":8.4886475684,"logSppN":0.4093189864}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115397.517134, sill 0.329235570461, nugget 0.313588812584
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753326861106, {{"Intercept":8.427068785,"logSppN":0.3933079309},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654894408,"logSppN":0.3772968794},"1":{"Intercept":8.4886481292,"logSppN":0.4093189823}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115397.969008, sill 0.329235581842, nugget 0.313588819209
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753325897057, {{"Intercept":8.4270688417,"logSppN":0.3933079273},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654893072,"logSppN":0.3772968752},"1":{"Intercept":8.4886483762,"logSppN":0.4093189794}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115397.941371, sill 0.329235586045, nugget 0.313588813343
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753325472221, {{"Intercept":8.4270688879,"logSppN":0.3933079273},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654892717,"logSppN":0.377296875},"1":{"Intercept":8.4886485041,"logSppN":0.4093189796}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115397.941371, sill 0.329235586045, nugget 0.313588813343
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753325472221, {{"Intercept":8.4270688879,"logSppN":0.3933079273},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654892717,"logSppN":0.377296875},"1":{"Intercept":8.4886485041,"logSppN":0.4093189796}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115397.941371, sill 0.329235586045, nugget 0.313588813343
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.753325472221, {{"Intercept":8.4270688879,"logSppN":0.3933079273},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.3654892717,"logSppN":0.377296875},"1":{"Intercept":8.4886485041,"logSppN":0.4093189796}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 115397.941371, sill 0.329235586045, nugget 0.313588813343

In [46]:
resum.as_text


Out[46]:
<bound method Summary.as_text of <class 'statsmodels.iolib.summary.Summary'>
"""
                            GLS Regression Results                            
==============================================================================
Dep. Variable:             logBiomass   R-squared:                       0.753
Model:                            GLS   Adj. R-squared:                  0.753
Method:                 Least Squares   F-statistic:                 5.623e+04
Date:                Sat, 03 Mar 2018   Prob (F-statistic):               0.00
Time:                        00:21:06   Log-Likelihood:                -16502.
No. Observations:               18414   AIC:                         3.301e+04
Df Residuals:                   18412   BIC:                         3.302e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      8.4271      0.031    268.235      0.000         8.365     8.489
logSppN        0.3933      0.008     48.149      0.000         0.377     0.409
==============================================================================
Omnibus:                     1129.018   Durbin-Watson:                   1.973
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1762.135
Skew:                          -0.509   Prob(JB):                         0.00
Kurtosis:                       4.122   Cond. No.                         4.10
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
""">

In [25]:
plt.plot(resultspd.rsq)
plt.title("GLS feedback algorithm")
plt.xlabel("Number of iterations")
plt.ylabel("R-sq fitness estimator")


Out[25]:
<matplotlib.text.Text at 0x7f212f0c9850>

In [26]:
resultspd.columns


Out[26]:
Index([u'conf_int', u'params', u'pvals', u'rsq'], dtype='object')

In [34]:
a = map(lambda x : x.to_dict(), resultspd['params'])

In [36]:
paramsd = pd.DataFrame(a)

In [38]:
paramsd


Out[38]:
Intercept logSppN
0 8.446694 0.394964
1 8.426869 0.393331
2 8.427064 0.393308
3 8.427068 0.393308
4 8.427069 0.393308
5 8.427069 0.393308
6 8.427069 0.393308
7 8.427069 0.393308
8 8.427069 0.393308
9 8.427069 0.393308

In [44]:
plt.plot(paramsd.Intercept.loc[1:])
plt.get_yaxis().get_major_formatter().set_useOffset(False)



AttributeErrorTraceback (most recent call last)
<ipython-input-44-8a4331454875> in <module>()
      1 plt.plot(paramsd.Intercept.loc[1:])
----> 2 plt.get_yaxis().get_major_formatter().set_useOffset(False)

AttributeError: 'module' object has no attribute 'get_yaxis'

In [45]:
fig = plt.figure(figsize=(10,10))
plt.plot(paramsd.logSppN.iloc[1:])


Out[45]:
[<matplotlib.lines.Line2D at 0x7f212eeec190>]

In [ ]:
variogram_data_path = "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv"
thrs_dist = 100000
emp_var_log_log = pd.read_csv(variogram_data_path)

Instantiating the variogram object


In [ ]:
gvg = tools.Variogram(section,'logBiomass',using_distance_threshold=thrs_dist)
gvg.envelope = emp_var_log_log
gvg.empirical = emp_var_log_log.variogram
gvg.lags = emp_var_log_log.lags
#emp_var_log_log = emp_var_log_log.dropna()
#vdata = gvg.envelope.dropna()

Instantiating theoretical variogram model


In [ ]:
matern_model = tools.MaternVariogram(sill=0.34,range_a=100000,nugget=0.33,kappa=4)
whittle_model = tools.WhittleVariogram(sill=0.34,range_a=100000,nugget=0.0,alpha=3)
exp_model = tools.ExponentialVariogram(sill=0.34,range_a=100000,nugget=0.33)
gaussian_model = tools.GaussianVariogram(sill=0.34,range_a=100000,nugget=0.33)
spherical_model = tools.SphericalVariogram(sill=0.34,range_a=100000,nugget=0.33)

In [ ]:
gvg.model = whittle_model
#gvg.model = matern_model
#models = map(lambda model : gvg.fitVariogramModel(model),[matern_model,whittle_model,exp_model,gaussian_model,spherical_model])

gvg.fitVariogramModel(whittle_model)

In [ ]:
import numpy as np
xx = np.linspace(0,1000000,1000)

gvg.plot(refresh=False,with_envelope=True)
plt.plot(xx,whittle_model.f(xx),lw=2.0,c='k')
plt.title("Empirical Variogram with fitted Whittle Model")
FOr the sake of brevity I didn't include the steps for calculating the covariance matrix and include it in the GLS routine. These are the results after fitting. """ GLS Regression Results ============================================================================== Dep. Variable: logBiomass R-squared: 0.900 Model: GLS Adj. R-squared: 0.900 Method: Least Squares F-statistic: 3.304e+05 Date: Sun, 28 Jan 2018 Prob (F-statistic): 0.00 Time: 12:48:52 Log-Likelihood: -34662. No. Observations: 36868 AIC: 6.933e+04 Df Residuals: 36866 BIC: 6.935e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 8.4705 0.010 840.884 0.000 8.451 8.490 logSppN 0.3909 0.006 66.280 0.000 0.379 0.402 ============================================================================== Omnibus: 1933.532 Durbin-Watson: 1.906 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2819.569 Skew: -0.475 Prob(JB): 0.00 Kurtosis: 3.966 Cond. No. 3.75 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """

In [ ]:
def randomSelection(n,p):
    idxs = np.random.choice(n,p,replace=False)
    random_sample = new_data.iloc[idxs]
    return random_sample
#################
n = len(new_data)
p = 3000 # The amount of samples taken (let's do it without replacement)

In [ ]:
random_sample = randomSelection(n,100)